Fix CJKBigramFilter inconsistent positions with outputUnigrams disabled by herley-shaori · Pull Request #15825 · apache/lucene

herley-shaori · 2026-03-14T22:53:56Z

Summary

CJKBigramFilter produces different token positions for the same input depending on whether outputUnigrams is true or false. This breaks phrase queries when index-time and search-time analyzers use different outputUnigrams settings — a common optimization pattern for CJK search.

Root cause

In flushBigram(), when outputUnigrams=false, bigrams are emitted with the default positionIncrement=1, but a bigram conceptually spans two character positions. After a word break (punctuation, whitespace, or non-CJK text), subsequent tokens are assigned positions that are off by 1 compared to the outputUnigrams=true case.

Example with input "一二、三":

outputUnigrams=true:  一(pos0) 一二(pos0) 二(pos1) 三(pos2)
outputUnigrams=false: 一二(pos0) 三(pos1) ← should be pos2

Fix

Following the principle suggested by @rmuir — outputUnigrams=false should behave as if unigrams were emitted, then later removed — this PR tracks whether bigrams were emitted from the current CJK segment and defers an extra position increment (+1) to apply to the first token after a segment boundary.

Two new fields in CJKBigramFilter:

hadBigrams: set true when a bigram is flushed in no-unigram mode
deferredPosInc: accumulated extra position increment, applied at the next segment transition (unaligned offsets, non-CJK token, or end of stream)

The deferred increment is applied in flushBigram(), flushUnigram(), and the non-CJK passthrough path in incrementToken().

Changes

CJKBigramFilter.java: Added position tracking logic across CJK segment boundaries
TestCJKBigramFilter.java: Added 3 new test cases reproducing the bug; updated testHanOnly expected positions
TestWithCJKBigramFilter.java (ICU): Updated expected positions in testJa2, testMix, testMix2, testReusableTokenStream, and testFinalOffset
CHANGES.txt: Added bug fix entry

Test plan

All 15 CJKBigramFilter tests pass (including 3 new tests)
All 12 ICU TestWithCJKBigramFilter tests pass
Code formatting verified via ./gradlew tidy
testBigramPositionsConsistentAcrossWordBreak — reproduces exact scenario from issue
testBigramPositionsMultipleSegments — verifies across multiple CJK segments with breaks
testBigramPositionsBeforeNonCJK — verifies CJK bigram followed by non-CJK text

…ams disabled (apache#15812) When outputUnigrams=false, CJKBigramFilter produced different token positions compared to outputUnigrams=true. A bigram spans two character positions but only advanced the position counter by 1. After a word break (punctuation, whitespace, or non-CJK text), subsequent tokens were assigned incorrect positions, breaking phrase queries in combined unigram+bigram indexing strategies. The fix tracks whether bigrams were emitted from the current CJK segment and defers an extra position increment (+1) to apply to the first token after a segment boundary. This ensures outputUnigrams=false behaves "as if unigrams were emitted then removed", keeping positions aligned across both settings. Example: "一二、三" Before: 一二(pos0) 三(pos1) — wrong, positions don't match After: 一二(pos0) 三(pos2) — correct, matches outputUnigrams=true

rmuir · 2026-03-14T23:03:06Z

Looks great! I think CJKAnalyzer may use this filter, and now it's position increments will have changed. Can you glance at the failing tests?

…mFilter behavior CJKAnalyzer uses CJKBigramFilter with outputUnigrams=false, so its tests need the same position increment updates applied to TestCJKBigramFilter and TestWithCJKBigramFilter: after a CJK bigram segment boundary, the next token now correctly gets positionIncrement=2 instead of 1. Updates testJa2, testMix, testMix2, testReusableTokenStream, and testFinalOffset.

herley-shaori · 2026-03-16T01:53:32Z

Looks great! I think CJKAnalyzer may use this filter, and now it's position increments will have changed. Can you glance at the failing tests?

Done! All tests have passed.

rmuir · 2026-03-16T03:01:46Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/cjk/CJKBigramFilter.java

    termAtt.setLength(len);
    offsetAtt.setOffset(startOffset[index], endOffset[index]);
    typeAtt.setType(SINGLE_TYPE);
+    if (!outputUnigrams && deferredPosInc > 0) {


this part of the if seems unnecessary, since deferredPosInc is never incremented unless !outputUnigrams.

Suggested change

if (!outputUnigrams && deferredPosInc > 0) {

if (deferredPosInc > 0) {

Thanks for the review! Applied your suggestion and extended the same reasoning to the other guards:

flushUnigram(): removed !outputUnigrams && (your suggestion)

flushBigram(): added if (deferredPosInc > 0) guard to skip the redundant setPositionIncrement(1) when
clearAttributes() already defaults to 1

incrementToken() (both segment boundary checks): removed !outputUnigrams && before hadBigrams — since hadBigrams is only ever set true inside the !outputUnigrams branch of flushBigram(), the outer check is redundant.

Also fixed TestCJKAnalyzer (testJa2, testMix, testMix2, testReusableTokenStream, testFinalOffset) — same position increment updates needed since CJKAnalyzer uses CJKBigramFilter with outputUnigrams=false.

deferredPosInc is only ever incremented when !outputUnigrams, so the extra condition is unnecessary. Suggested by rmuir in review.

hadBigrams is only ever set true inside the !outputUnigrams branch of flushBigram(), so checking !outputUnigrams before testing hadBigrams is redundant. Same reasoning applies to the deferredPosInc guard in flushBigram() — clearAttributes() already defaults posInc to 1, so we only need to call setPositionIncrement when deferredPosInc > 0.

github-actions bot added the module:analysis label Mar 14, 2026

github-actions bot added this to the 11.0.0 milestone Mar 14, 2026

rmuir reviewed Mar 16, 2026

View reviewed changes

herley added 2 commits March 16, 2026 10:24

Simplify flushUnigram guard: remove redundant !outputUnigrams check

292ed6e

deferredPosInc is only ever incremented when !outputUnigrams, so the extra condition is unnecessary. Suggested by rmuir in review.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CJKBigramFilter inconsistent positions with outputUnigrams disabled#15825

Fix CJKBigramFilter inconsistent positions with outputUnigrams disabled#15825
herley-shaori wants to merge 4 commits intoapache:mainfrom
herley-shaori:fix/15812-cjk-bigram-position-inconsistency

herley-shaori commented Mar 14, 2026

Uh oh!

rmuir commented Mar 14, 2026

Uh oh!

herley-shaori commented Mar 16, 2026

Uh oh!

rmuir Mar 16, 2026

Uh oh!

herley-shaori Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	if (!outputUnigrams && deferredPosInc > 0) {
	if (deferredPosInc > 0) {

Conversation

herley-shaori commented Mar 14, 2026

Summary

Root cause

Fix

Changes

Test plan

Uh oh!

rmuir commented Mar 14, 2026

Uh oh!

herley-shaori commented Mar 16, 2026

Uh oh!

rmuir Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

herley-shaori Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants